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Abstract 



We design a new distribution over poly(re _1 ) x n matrices S so that for any fixed n x d matrix A 
of rank r, with probability at least 9/10, ||.5M.:e|| 2 = (1 ± e)||^4x|| 2 simultaneously for all x G R d . Such 
a matrix S is called a subspace embedding. Furthermore, 5*^4 can be computed in nnz(A) + poly(cfe ) 
time, where nnz(^l) is the number of non-zero entries of A. This improves over all previous subspace 
embeddings, which required at least Q(nd\ogd) time to achieve this property. We call our matrices S 
sparse embedding matrices. 

Using our sparse embedding matrices, we obtain the fastest known algorithms for overconstrained 
least-squares regression, low-rank approximation, and approximating all leverage scores. 

• to output an x' for which \\Ax' — b\\ 2 < (1 + e) min x \\Ax — b\\ 2 for an n x d matrix A and an n x 1 
column vector b, we obtain an algorithm running in 0(nnz(A)) +poly(de _1 ) time. 

• to obtain a decomposition of an n x n matrix A into a product of an n x k matrix L, a k x k 
diagonal matrix D, and a k x n matrix R, for which 



• to output an approximation to all leverage scores of an n x d input matrix A simultaneously, with 
constant relative error, our algorithms run in 0(nnz(A) logn) + poly(r logn) time. 

We optimize the polynomial factors in the above stated running times, and show various tradeoffs. We 
also provide preliminary experimental results which suggest that our algorithm is competitive in practice. 

1 Introduction 

A large body of work has been devoted to the study of fast randomized approximation algorithms for problems 
in numerical linear algebra. Several well-studied problems in this area include least squares regression, 
low rank approximation, and approximate computation of leverage scores. These problems have many 
applications in data mining [5] , recommendation systems [17] , information retrieval |37] , web search [TJ 132) , 
clustering [131134] . and learning mixtures of distributions |31l [2]. The use of randomization and approximation 
allows one to solve these problems much faster than deterministic methods. 

For example, in the overconstrained least-squares regression problem, we are given an n x d matrix A 
of rank r as input, n>d, together with an n x 1 column vector b. The goal is to output a vector x' so 
that with high probability, \Ax' — b\% < (1 + e) min x \\Ax — b\\z. The minimizing vector x* can be expressed 
in terms of the Moore-Penrose pseudoinverse A~ of A, namely, x* = A~b. If A has full column rank, this 
simplifies to x* = (A T A)~ 1 A T b. This minimizer can be computed deterministically in 0(nd 2 ) time, but 
with randomization and approximation, this problem can be solved in 0(ndlogd) + poly(d(lnn)/e) time 
[39l [24] , which is much faster for d <C n and e not too small. 

Another example is low rank approximation. Here we are given annxn matrix (which can be generalized 
to n x d) and an input parameter fc, and the goal is to find an n x n matrix A' of rank at most k for which 
\\A' — A\\p < (1 + e)\\A — Ak\\F> where for an n X n matrix B, \\B\\ F = Y^i=i S?=i ^ s tne squared 
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Frobenius norm, and Ak = argmin rankB<fe \\A — B\\p. Here Ak can be computed deterministically using 
the singular value decomposition in 0(r?) time. However, using randomization and approximation, this 
problem can be solved in 0{rmz{A) ■ (k/e + fclogfc) + n ■ poly(/c/e)) time[39l [8], where imz(A) denotes the 
number of non-zero entries of A. The problem can also be solved using randomization and approximation 
in 0(n 2 logn) + n ■ poly(A;/e) time |39j . which may be faster than the former for dense matrices and large k. 

Another problem we consider is approximating the leverage scores. Given an nx d matrix A with n» d, 
one can write A = UT,V T in its singular value decomposition, where the columns of U are the left singular 
vectors, E is a diagonal matrix, and the columns of V are the right singular vectors. Although U has 
orthonormal columns, not much can be immediately said about the squared lengths ||C^||! of its rows. These 
values are known as the leverage scores, and measure the extent to which the singular vectors of A are 
correlated with the standard basis. They are basis-independent, that is, the leverage scores of a matrix A 
are equal to the diagonal elements of the projection matrix onto the span of the columns of A; see [H] for 
background on leverage scores as well as a list of applications. The leverage scores will also play a crucial 
role in our work, as we shall see. The goal of approximating the leverage scores is to, simultaneously for 
each i £ [n], output a constant factor approximation to ||t/j|||. Using randomization, this can be solved in 
0(nd log n + d 3 log d logn) time [TO] , 

There are also solutions for these problems based on sampling. They either get a weaker additive error 
[23 EH El HH US [HI HOI EH! E], or they get bounded relative error but are slow [H CHI [22 [23] • Man y of 
the latter algorithms were improved independently by Deshpande and Vempala [12] and Sarlos [39], and in 
followup work [2H [Ml 133] • There are also solutions based on iterative and conjugate-gradient methods, see, 
e -g-> EH) or BU as a recent example. These methods repeatedly compute matrix- vector products Ax for 
various vectors x; in the most common setting, such products require 0(nnz(yl)) time. Thus the work per 
iteration of these methods is 0(nnz(A)), and the number of iterations N that are performed depends on the 
desired accuracy, spectral properties of A, numerical stability issues, and other concerns, and can be large. 
A recent survey suggests that N is typically 0(fc) for using Krylov methods (such as Arnoldi and Lanczos 
iterations) to approximate the k leading singular vectors 26 . One can also use some of these techniques 
together, for example by first obtaining a preconditioner using the Johnson-Lindenstrauss (JL) transform, 
and then running an iterative method. 

While these results illustrate the power of randomization and approximation, their main drawback is 
that they are not optimal. For example, for regression, ideally we could hope for 0(nnz(A)) + poly(d/e) 
time. While the 0(nd\ogd) + po\y(d/e) time algorithm for least squares regression is almost optimal for 
dense matrices, if nnz(A) <C nd, say nnz(A) = O(n), as commonly occurs, this could be much worse than an 
0(nnz(A)) + poly((i/e) time algorithm. Similarly, for low rank approximation, the best known algorithms 
that are condition-independent run in 0(nnz(A)(k/e + k log k) + n ■ poly(fc/e)) time, while we could hope for 
(9(nnz(A)) +poly(fc/e) time. 



1.1 Results 

We resolve the above gaps by achieving algorithms for least squares regression, low rank approximation, and 
approximate leverage scores, whose leading order term in the time complexity is 0(nnz(A)), with a constant 
factor that is independent of any properties of A. Our results are as follows: 

• Least Squares Regression: We present two different algorithms for annxd matrix A with rank 
r and given e > 0. The first has running time 0(imz(A) + d 5 e~ 4 \og 3 (d/e)). The second has running 
time 0(nnz(A) log(d/e) + r 3 e~ 2 (log(r /e))(log l/e)). 

• Low Rank Approximation: We achieve 0(nnz(A)) + n ■ poly(fc/e) time. More specifically, as 



stated by Theorem 31 for k small enough, we obtain 0(nnz(A) + nk e log (k/e)) time, which is 
0(rmz(A) + nk 4 log k) for fixed e, or 0(nnz(^4) \og 2 (k/e) + nke~ 8 \og(k/e)(k + log(l/e)) time, which 
is 0(nnz(A) log 2 k + nk 2 logfc) for fixed e. Here k < n 1 / 2 " 7 is small enough, for any fixed 7 > 0; we 
also have results for larger k. As described for Theorem [30j we also obtain an approximation to A of 
rank 0(k/e) in time 0(nnz(^4) +poly(fe/e)), which can be an improvement in running time even when 
nnz(A)/n is not large; the approximation is represented implicitly, as the product of three matrices. 
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• Approximate Leverage Scores: For any fixed constant e > 0, wc simultaneously approximate all n 
leverage scores in 0(nnz(A) log n + r 3 log 2 r + r 2 logn) time. This can be generalized to sub-constant 
e to achieve 0(nnz(A) log n) + poly(r/e) time, though in the applications we are aware of, such as 
coresets for regression [9], e is typically constant (in the applications of this, a general e > can be 
achieved by over-sampling [2TI [5]). 

1.2 Techniques 

All of our results are achieved by improving the time complexity of computing what is known as a subspace 
embedding. For a given n x d matrix A, call S : K. n <— > R a subspace embedding matrix for A if, for all 
x £ R d , \\SAx\\ 2 = (1 ± e)||AB|| 2 . That is, S embeds the column space C{A) = {Ax \ x £ R d } into K* while 
approximately preserving the norms of all vectors in that subspace. 

The subspace embedding problem is to find such an embedding matrix obliviously, that is, to design a 
distribution ir over linear maps S : M. n H > R* such that for any fixed n x d matrix A, if we choose S ~ 7r then 
with large probability, S* is an embedding matrix for A. The goal is to minimize t as a function of n, d, and 
e, while also allowing the matrix-matrix product S ■ A to be computed quickly. 

By taking S to be a Fast Johnson Lindenstrauss transform, one can set t = 0(d/e 2 ) and achieve 
0(nd log t) time for d < rt 1 / 2-7 for any constant 7 > 0. One can also take 5 to be a subsampled ran- 
domized Hadamard transform (see, e.g., [23]) and set t = 0(dlnn(\nd + lnlne? + lnl/e)/e 2 ), to achieve 
0(nd log t) time. These were the fastest known subspace embeddings achieving any value of t not depending 
polynomially on n. Our main result improves this to achieve t = poly(rf/e) for matrices S for which SA can 
be computed in rinz(v4) time! Given our new subspace embedding, we plug it into known methods of solving 
the above linear algebra problems given a subspace embedding as a black box. 

In fact, our subspace embedding is nothing other than the CountSketch matrix in the data stream liter- 
ature [5], see also [30]. This matrix was also studied by Dasgupta, Kumar, and Sarlos [TU]. Formally, S has 
a single randomly chosen non-zero entry S^-ij in each column j, for a random mapping h : [n] M- [t\. With 
probability 1/2, Sh(j),j — 1, and with probability 1/2, S^mj = — 1. 

While such matrices S have been studied before, the surprising fact is that they actually provide subspace 
embeddings. Indeed, the usual way of proving that a random S ~ n is a subspace embedding is to show 
that for any fixed vector y £ R d , Pr[||S , j/|| 2 = (1 ± s) || 2] > 1 — cxp(— d). One then puts a net (see, e.g., 
[1]) on the unit vectors in the column space C(A), and argues by a union bound that 1 1 *S* 2/ 1 1 2 = (1± £)|M|2 
for all net points y. Then, since the net is sufficiently fine, and since the mapping is linear, this implies that 
||5y|| 2 = (1 ± e)|M| 2 for all vectors y € C(A). 

We stress that our choice of matrices S does not preserve of norms of an arbitrary set of exp(d) vectors with 
high probability, and so the above approach cannot work for our choice of matrices S. We instead critically 
use that these exp(d) vectors all come from a d-dimensional subspace (namely, C(A)), and therefore have a 
very special structure. The structural fact we use is that there is a fixed set H of size d/a which depends only 
on the subspace, such that for any unit vector y in the subspace, H contains the indices of all coordinates 
of y larger than yfa in magnitude. The key property here is that the set H is independent of y, or in other 
words, only a small set of coordinates could ever be large as we range over all unit vectors in the subspace. 
The set H is exactly the set of large leverage scores of the column space of A\ 

Given this observation, by setting t > K\H\ 2 for a large enough constant K, we have that with probability 

1 — l/K, there are no two distinct j ^ j' with j,j' £ H for which h(j) = h(f). That is, the coordinates in 
H are "perfectly hashed" with large probability. Call this event £ , which we condition on. 

Given a unit vector y in the subspace, we can write it as y H + y L , where y H consists of y with the 
coordinates in [n] \ H replaced with 0, while y L consists of y with the coordinates in H replaced with 0. We 
seek to bound 

\\Sy\\l = \\Sy H \\ 2 + \\Sy L \\ 2 + 2{Sy H ,Sy L ). 

Since £ occurs, we have the isometry \\Sy H \\\ = \\y H \\2- Now, ||y L || 2 X) < a, and so we can apply Theorem 

2 of [10 which shows that for mappings of our form, if the input vector has small infinity norm, then S 
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preserves the norm of the vector up to an additive 0(e) factor with high probability. Here, it suffices to set 
a = l/poly(d/e). 

Finally, we can bound (Sy H , Sy L ) as follows. Define GC [n] \ H to be the set of coordinates j for which 
h(j) — h(j') for a coordinate f £ H, that is, those coordinates in [n] \H which "collide" with an element of 
H. Then, (Sy H ,Sy L ) = (Sy H , Sy L ), where y L is a vector which agrees with y L on coordinates j £ G, and 
is on the remaining coordinates. By Cauchy-Schwarz, this is at most \\Sy H \\% ■ \\Sy L || 2 . We have already 
argued that || «S'2/ jfr H2 = llj/^lb < 1 for unit vectors y. Moreover, we can again apply Theorem 2 of [TU] to 
bound \\Sy L H2, since, conditioned on the coordinates of y L hashing to the set of items that the coordinates 
of y H hash to, they are otherwise random, and so we again have a mapping of our form (with a smaller t and 
applied to a smaller n) applied to a vector with small infinity-norm. Therefore, \\Sy L ||a < 0(e) + \\y L H2 
with high probability. Finally, by Bernstein bounds, since the coordinates of y L are small and t is sufficiently 
large, \\y L || 2 < e with high probability. Hence, conditioned on event £, \\Sy\\2 = (l±e)||j/||2 with probability 
1 — exp(— d), and we can complete the argument by union-bounding over a sufficiently fine net. 

We note that an inspiration for this work comes from work on estimating norms in a data stream with 
efficient update time by designing separate data structures for the heavy and the light components of a 
vector [351 130] . A key concept here is to characterize the heaviness of coordinates in a vector space in terms 
of its leverage scores. 

Optimizing the additive term: The above approach already illustrates the main idea behind our subspace 
embedding, showing that it can be implemented in nnz(A) time. This is sufficient to achieve our numerical 
linear algebra results in time 0{smz(A)) + po\y(d/e) for regression and 0(nnz(^4)) + n ■ poly(fc/e) for low 
rank approximation. However, for some applications d, fc, or 1/s may also be large, and so it is important to 
achieve a small degree in the additive poly(d/e) and n ■ poly(fc/e) factors. The number of rows of the matrix 
S is t = poly(d/e), and the simplest analysis described above would give roughly t = {d/e) s . We now show 
how to optimize this. 

The first idea for bringing this down is that the analysis of [TU] can itself be tightened by using that we 
are applying it on vectors coming from a subspace instead of on a set of arbitrary vectors. This involves 
observing that in the analysis of [TUJ, if on input vector y and for every i £ [t], Ylj\h(j)=i Vj 1S sma ^ then 
the remainder of the analysis of |10j does not require that ||j/||oo be small. Since our vectors come from a 
subspace, it suffices to show that for every i £ [t], Ylj\h(j)=i \W3W2 ^ s smau > where \\Uj\\\ is the j-th leverage 
score of A. Therefore we do not need to perform this analysis for each y, but can condition on a single event, 
and this effectively allows us to increase a in the outline above, thereby reducing the size of T, and also the 
size of t since we must have t — il(\H\ 2 ). In fact, we instead follow a simpler and slightly tighter analysis of 
[2"U] based on the Hanson- Wright inequality. 

The second idea is that the estimation of ||y ff ||2, the contribution from the "heavy coordinates", is 
inefficient since it requires a perfect hashing of the coordinates. Indeed, while the above optimization is tuned 
for efficiently estimating ||y i || 2 , the number t of rows needed is still roughly (d/e) A due to the estimation 
°f \\y lb- We instead estimate the contribution from the heavy coordinates in a different manner. By 
standard balls-and-bins analyses, if we have 0(d 2 /\ogd) bins and d 2 balls, then with high probability each 
bin will contain O(logd) balls. We thus make t roughly d 2 and think of having 0(d 2 / \ogd) bins. In each 
bin i, 0(logd) heavy coordinates j will satisfy h(j) = i. Then, we apply a separate JL transform on the 
coordinates that hash to each bin i. This JL transform maps a vector z £ M. n to an 0((log d) /e 2 )-dimensional 
vector z' for which ||^'|| 2 = (1 ± £ )IN|2 with probability at least 1 — l/poly((i). Since there are only O(logd) 
heavy coordinates mapping to a given bin, we can put a net on all vectors on such coordinates of size 
only poly(d). We can do this for each of the 0(d 2 / log d) bins and take a union bound. It follows that 
the 2-norm of the vector of coordinates that hash to each bin is preserved, and so the entire vector y H of 
heavy coordinates has its 2-norm preserved. By a result of [29], the JL transform can be implemented in 
0((logd)/e) time, giving total time 0(nnz(v4) logd)/e), and this reduces t to roughly d 2 /e 4 . 

We also note that for applications such as least squares regression, it suffices to set e to be a constant in 
the subspace embedding, since we can use an approach in [21, 9 which, given constant-factor approximations 
to all of the leverage scores, can then achieve a (1 + ^-approximation to least squares regression by slightly 
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over-sampling rows of the adjoined matrix A o b proportional to its leverage scores, and solving the induced 
subproblcm. This results in a better dependence on e. 

We can also compose our subspace embedding with a fast JL transform to further reduce t to the optimal 
value of about d/s 2 . Since S ■ A already has small dimensions, applying a fast JL transform is now efficient. 

Finally we can use a recent result of [7] to replace most dependencies on d in our running times for 
regression with a dependence on the rank r of A, which may be smaller. 

1.3 Experiments 

In <|8j we give some preliminary experimental results for an application to low-rank approximation. Here we 
consider the simplest version of our techniques, where for sparse embedding matrices S and R, we compare 
the error of the low-rank approximation AR T (SAR T )~ SA to A with that of the best rank-fc approximation, 
for different sketching dimensions for S and R, and hence ranks for the low-rank approximation. The results 
are encouraging, and (as often happens) better than the worst-case bounds we are able to prove. We consider 
a broad class of matrices, and find, for example, that when t/k > 4, all but a few such approximations have 
error at most 1.21 times the error of the best rank-fc matrix. 

2 Sparse Embedding Matrices 

Let A £ M nxd . We assume n > d. Let nnz(A) denote the number of non-zero entries of A. We can assume 
rmz(A) > n and that there are no all-zero rows or columns in A. 

For a parameter t, we define a random linear map <£>£> : R" — > R* as follows: 

• h : [n]t-> [t] is a random map so that for each i £ [n], h(i) = t' for t' £ [t] with probability 1/t. 

• $ £ {0, l} tx ™ is a t x n binary matrix with §h{i),i — lj an d all remaining entries 0. 

• D is an nx n random diagonal matrix, with each diagonal entry independently chosen to be +1 or — 1 
with equal probability. 

We will refer to matrices of the form <£>_D as a sparse embedding matrix. 

3 Analysis 

Let U £ W ixr have columns that form an orthonormal basis for the column space C(A). Let U\,*i ■ • ■ , U n ^ 
be the rows of U, and let w; = ||C/i,*|| ■ Note that J2i u i = \\U\\p = r. A unit vector y £ C(A) can be 
expressed as y = Ux for unit vector x, and so yf = (Ui^x) 2 < Ui\\x\\ — Ui- 

Unless otherwise indicated, a vector norm \\y\\ is the £2 norm. 

Let T > be a parameter. 

3.1 Handling vectors with small entries 

We begin the analysis by considering fixed unit vectors y £ C(A) such that for all i £ [n] for which it; > T, 
we have yi — 0. Notice that since y 2 < u,, this in particular implies that WyW 2 ^ < T. We extend this to all 
unit vectors in subsequent sections. 

The following is similar to Lemma 6 of [TU], and is a standard balls-and-bins analysis. 

Lemma 1 For T > 0, 6 £ (0, 1), and t > Mr 2 for M larger than a sufficiently large positive constant, let 
Eh denote the event that 

2 x ^ 

> max > Uj. 

Mr ~ Mt] *-Z t 

Ui<T 
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Proof: Wc will prove that the bound holds for fixed j £ [t] with failure probability S/t, and then apply a 
union bound. 

Let Xi denote the random variable UiIf l (i)=j lUi <T- We will apply Bernstein's inequality to Y]^ Yj where 
Yi = Xi — E[JQ]. We have \Yi\ < T, and with that bound, Bernstein's inequality states that 



logPr 



5>>z 



< 



-z 2 /2 



We have E[Y?] < ELY?] < UiT/t so that £\ E[Y 2 } < rT/t. Hence, |lj becomes 

-z 2 /2 



(1) 



log Pr 



E y *>* 

ie[t] 



< 



rT/t + zT/3' 



(2) 



Using that T < l/(6Mr log(t/<5)) and that t > Mr 2 , we have that rT/t < l/(6M 2 r 2 log(i/<5)). We set 
z = l/(Mr), and consequently have that zT/3 < l/(3M 2 r 2 \og(t/S)). Hence, 



log Pr 



ie[t] 



< 



-l/(2M 2 r 2 ^ 



l/(2M 2 r 2 log(t/<$)) 



log(^). 



Now since Ei E W ^ r A ^ 1 /( M r), 



Pr 



E*< 

»e[t] 



> 



Mr 



< Pr 



»e[t] 



> 



Mr 



< exp(-log(t/£)) = <J/t. 



A union bound completes the proof. 



Lemma 2 J/ t, T, and M satisfy the conditions of Lemma [7J the event Eh holds, and vector y of at most 
unit norm formed by taking a vector y' £ C(A) of at most unit norm and replacing y[ with whenever 
u i > T , then for any £ > 2, let £r,(e) be the event that |||$Z)j/|| 2 — \\y\\ 2 \ < e. Then 

Pr[£ L (e)} >l-e- e Z e , 

where Z — max {VI- \\yh^2/{Mr),i-2/{Mr)}. Here y can depend on but not on D. 

Proof: We will use the following theorem, due to Hanson and Wright. 

Theorem 3 \27l Let z £ M. n be a vector of i.i.d. ±1 random values. For any symmetric B £ W ixn and 
£>2, 

E [\z T Bz - tr(B)\ e ] < (CZf, where Z = max{V£||B|| F , t ■ \\B\\ 2 }, 
and C > is a universal constant. 

We will use Theoremjijto prove a bound on the ^'th moment of H&Dj/Hg for large I. Note that ||$Z)y|| 2 
can be written as z T Bz, where z has entries from the diagonal of D, and B £ M. nxn has Bui = yiyi'Ih(i)=h(i')- 
Here tr(S) = ||y|| 2 . 

Our analysis uses some ideas from the proofs for Lemmas 7 and 8 of [29] . 

Since by assumption event Eh of Lemma [I] occurs, and for a vector y £ C(A) of at most unit norm 
y 2 , < Ui< for all i', we have for j £ [t] that Yli'eh- 1 ^) vf' — 2/ (Mr). Hence 

\\B\\% = E(fiWi') a 4(i')=fcW = E% 2 E Vi> ^ E Vi 2 /( Mr ) < Ml ' 2 /( Mr )- ( 3 ) 

i.i' ie[n] i'e/l _1 (/t(i)) i£[n] 



G 



For ||-B|| 2 , observe that for given j € [t], y^) G R n with j/p'^ = yilh(i)=j is an eigenvector of B with eigenvalue 
Hy^H 2 , and the set of such eigenvectors spans the column space of B. It follows that 

||B|| 2 = m ax|h,«|| 2 = yl<V(Mr). 
Putting this and Q into the Z of Theorem [3j we have, 



Z < max{Vl\\B\\ F ,e- \\B\\ 2 } < max{\^ • \\y\\ 2 ^2/(Mr),£ ■ 2/ (Mr)} 
By a Markov bound applied to \z T Bz — tr(B)\ e , 



Pr[|||$^|| 2 -||y|| 2 |> e ] < <r'Z l 



3.2 Handling vectors with large entries 

Now consider a unit vector y with yi = if Ui < T, for all i e [n] . 

Lemma 4 For a parameter T > 0, let H d [n] denote the subset of rows i for which Ui > T . Let denote 
the event that for all i,j 6 H with i ^ j, h(i) ^ h(j). Then for T > Arjyft, Pr[£g] > 15/16. 



Proof: Since the columns of U are orthonormal, ^ 



ieln] 



r. Therefore \H\ < r/T < 



yji/l. For fixed i ^ j, Pr[h(i) = h(j)} = l/t. Therefore by a union bound, the probability that some i ^ j 
have h(i) = h(j) is at most \H\ 2 /t < 1/16. Thus Pr[S B ] > 15/16. ■ 

Given event £g, we have that for any y of the form above, 

\\vh = \\^Dy\\ 2 . 

More generically, let £h be the event that for such a y, \\&Dy\\ 2 < (1 + e)\\y\\. 
3.3 Handling all vectors 

We have seen that $L> preserves the norms for vectors with small entries (Lemma [2]) and large entries 
(Lemma |4| . Before proving a general bound, we need to prove a bound on the "cross terms" . 
First, a bound on a key contributor to the cross terms. 

Lemma 5 For fixed unit y G R n , let vector y L have yf — y%I u ^<T and vector y H = y — y L . Let H be the 
set of indices of nonzero coordinates of y H , and \H\ < s = r/T. Let G = {i | i G ft, _1 (i'),z' G H}. Let 



y L G R n have yf = yfL ieG . For 5' > 0, suppose T < e 2 / log(l/<5'), and t > Qr\og(l/5')/t i . Then it holds 
with failure probability at most 8' that \\y L || 2 < 2e 2 . 



Proof: To bound \\y L ||, let X, denote the random variable y 2 L i£ Q. We will apply Bernstein's inequality 
to Yi where Yi = Xi — E[JQ]. We have < T, and with that bound, Bernstein's inequality states that 



log Pr 



< 



-z 2 /2 



j E,E K 2 ]+zT/3- 
We have E^ 2 ] < E[X 2 ] < yfTs/t so that E l E K 2 ] < Ts /t = r/t. Hence, (El becomes 



(4) 



log Pr 



< 



-z 2 /2 
r/t + zT/3' 
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Using that T < e 2 /log(l/(S') and that t > 6r log(l/<5')/e 4 , 



logPr 

Now since Y,i E [ x *] <s/t<e 2 , 



<£[*] 



< 



e 4 /log(l/<5') 



= -log(l/<J'). 



Pr 



ie[t] 



< Pr 



E^ 2 



<exp(-log(l/0)=5'. 



Hence 



as claimed. 



Pr[||y i '|| 2 <2e 2 ]>l- ( 5', 



Lemma 6 For fixed unit y € C(A), and using the notation of the previous lemma, suppose the event of 
the previous lemma holds, and also event Sh, and also £i J (e 2 ) holds for y L . Then \(y H ) T DQ T $>Dy L \ < 
v/3e(l + e). 



Proof: From the assumptions of £h and £l(c 2 ) for V 



V 



(y H ) T D<5> T <5>Dy L = {y H ) T D^^Dy L ' < \\^>Dy H \\ \\<t>Dy L ' \\ < (l + e)J\\y 



,L'\\2 



Using the bound on \\y || from the previous lemma, assumed to hold, the lemma follows. I 

Lemma 7 Given any K € (0,1), for fixed y € C{A), assuming the parameter conditions for Lemmas^ [1| 
and^ /ioM, and events £h andSh, then for embedding dimension t — 0((r/e) 4 log 2 (r/e)), with probability 
at least 1 - K r , \\®Dy\\ 2 = (1± e)|jy|| 2 - 



Proof: Write y = y H + y L , where y L has yf = yiI Ui <T- We bound 



\*>Dy\\ 



\<5>Dy 



H\\2 



\<S>Dy 



i II 2 



2y H D<S> T <S>Dy L 



(•5) 



We apply Lemma [2] with £ = r. To satisfy the other conditions of the lemmas, we need to choose T, t, 5', 
and M such that 

1. For given C > 0, (2/(Me 2 )) r / 2 < C r . So M = ft(e~ 2 ) suffices for this; 

2. T < l/(6Mrlog(*/£)) = 0(l/(e 2 r log(t))) for fixed 5; 

3. T>4r/y/t\ 

4. 5' < Cg, for small enough Cs > 0; 

5. T<e 2 /log(t/5') = 0(e 2 /(rlog(t))); 

6. t > 6rlog(l/ ( 5')/e 4 - 0(r 2 /e 4 ); 

There are values T = (3(e 2 /(r log(r/e))) and t = (3((r/e) 4 log 2 (r/e)) that satisfy these conditions. 

With the first two conditions and the event £} ll Lemma [2] implies that Hl^-Dy^l 2 — ||y L || 2 | < e with failure 
probability C~ r . 

The third condition is needed to apply Lemma|4j which implies that £b, and hence £h, hold with constant 
probability. 

The remaining conditions imply by Lemma [5] that \\y L || 2 < 2e 2 with failure probability at most 5'. 
This and the first two conditions (with insignificantly larger M), and £h, imply from Lemma [2] that £i(e 2 ) 
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holds for y L with failure probability at most C r . (Here we can consider the hash function in Lemma [2] to 
be restricted to the buckets containing the indices in H, but the results of the lemma still apply.) With 
Lemma^ this implies that \(y H ) T D<S> T <$>Dy L \ < VSe(l + e). 

That is, with failure probability at most 2C r + C r s < K r , where C and C5 are chosen less than K/3, we 
have |||$Dy||2 - \\y\\ 2 \ < 2e + \/3e(l + e). Adjusting e gives the result. ■ 

Lemma 8 Suppose L is an r- dimensional subspace o/R n , and B : K.™ — » R k is a linear map. If for any 
fixed x 6 L, \\Bx\\\ = (1 ± £/6)||a;||| with probability at least 1 — p, then there is a constant C su b > for 
which with probability at least 1 — p(C su b) r , for all x £ L, ||_Ba;||2 = (1 ± 

Proof: We will need the following standard lemmas for making a net argument. Let S r ~ 1 be the unit 
sphere in R r and let E be the set of points in S r ~ 1 defined by 

E= lw:w£ -^Z r , H| 2 < 1 



where IT is the r-dimensional integer lattice. 

Fact 9 (Lemma 4 of [4]) \E\ < e cr for c = ( ; +2). 

Fact 10 (Lemma 4 of |4j) For any r x r matrix J, if for every u,v £ E we have |u T Jw| < e, then for 
every unit vector w, we have \w T Jw\ < (i_f 7 )2 ■ 

Let U £ ]R™ xr be such that the columns are orthonormal and the column space equals L. Let I r be the r x r 
identity matrix. Define J = U T B T BU—I r . Consider the set E in Fact[9]and Fact 10 Then, for any x,y £ E, 



we have by the statement of the lemma that with probability at least 1 — 3p, \\BUx\\2 = (1 ± e/6)||C/a;|||, 
\\BUy\\l - (l±e/6)||C/y|||, and ||S^(a;+ W )||l - (l± £ /6)||C/(x+y)||l = (l±e/6)(\\Ux\\ 2 2 + \\Uy\\ 2 2 +2(Ux,Uy)). 
Since ||£^||2 < 1 and ||i/y||2 < 1, h follows that \xJy\ < s/2. By Fact^l for 7 = 1 — 1/V2 and sufficiently 
large C su b, we have by a union bound that with probability at least 1 — p C r anih for a constant C su b > 0, that 
\xJy\ < e/2 for every x,y £ E. Hence, with this probability, by Fact |l0[ \w T Jw\ < e for every unit vector 
w, which by definition of J means that for all y £ L, H-BylH = (1 ± £ )I|2/||2- ^ 

The following is our main theorem in this section. 

Theorem 11 There is t — (9((r/e) 4 log 2 (r/e)) such that with probability at least 9/10, $1? is a subspace 
embedding matrix for A; that is, for all y £ C(A), \\<f>Dy\\ 2 — (l±e)||y|| 2 . The embedding <£>_D can be applied 
in 0(nnz(A)) time. 

Proof: We set S = 1/60 in Lemma[TJ so that £h occurs with probability at least 1 — 1/60. Next, the event 
£b of Lemma |4] occurs with probability at least 15/16. By a union bound, both events £h and Eh occur 
with probability at least 1 — 1/60 — 1/16. Conditioned on this, by Lemma[7j for any fixed y £ C(A), with 
probability at least 1 — K r , ||$Dy||2 = (1 ± e)||i/||2, where K > can be chosen arbitrarily small. Hence, by 
Lemma[8j and choosing K < l/60C s „b, with probability at least 1 — 60 _r , ||$Dy||2 = (1 ± 6e)||j/||2 for all 
y £ C(A). Hence, by a union bound, with probability at least 9/10, ||$Dj/||2 = (1 ± 6£)||y|| 2 for all y £ L. 
The theorem follows by rescaling e by a constant factor. I 



4 Generalized Sparse Embedding Matrices 
4.1 Johnson-Lindenstrauss transforms 

We start with a theorem of Kane and Nelson 29J , restated here in our notation. We also present a simple 
corollary that we need concerning very low dimensional subspaces. Let e > 0, k — 6(e~ 2 log(r/e)), and 
v = Q(e~ 1 ) be such that v divides k. Let B : R n — > K fc be defined as follows. We view B as the concatenation 
(meaning, we stack the rows on top of each other) of matrices \fv/k ■ $1 • Di, . . . , \fvjk ■ &k/v ' D k / V , each 
<&i • Di being a linear map from K." to M. v , which is an independently chosen sparse embedding matrix of 
Section [3] with associated hash function hi : [n] — > [v] . 
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Theorem 12 (fMSj) For any e > and constant Ckn > 0, there are k — 9(e -2 log(r/e)) and v = 9(e _1 ) 
withy | k for which for any fixed x G R", a randomly chosen B of the form above satisfies \\Bx\W = (l±e)||x||2 
with probability at least 1 — (s/r) KN . 

Corollary 13 Suppose L is an 0(log(r /e))- dimensional subspace o/R™. Let C su i,kn > be any constant. 



Then for any < e < 1, there are k — 0(e 2 log(r/e)) and v 
at least 1 - ( £ /r) c ^ bKN , for all y £ L, \\By\\% = (1 ± e)||j/|||. 



0(e ) with v I k for which with probability 



Proof: We use Theorem [l2| together with Lemma [§J for the latter, we need that for any fixed y £ L, 



\ByW2 = (1 ± e /6)|Ml2 with probability at least 1 — p. By Theorem 12 we have this for p = (ej 



for an arbitrarily large constant Ckn > 0. Hence, by Lemma [51 there is a constant C su b > so that with 
probability at least 1 - (C sub )°^ r l^\e/r) CKN = 1 - (e/r)^ bKN , for all y £ L, \\By\\% = (1 ± e)||y|||. 
Here we use that Ckn > can be made arbitrarily large, independent of C su b- H 



4.2 The construction 



and Corollary 13 apply 



We now define a generalized sparse embedding matrix S. Let A £ M™ with rank r 
Let k = Q(e~ 2 log(r/e)) and v = 0(e _1 ), with v | k, be such that Theorem 12 
with parameters k and v, for a sufficiently large constant C su bKN > 0. Further, let 

t = C t r£- 4 log(r/£)(r + log(l/e)), 

where Ct > is a sufficiently large absolute constant, and suppose that fc | t. Let 5 = t/k. 

Let /i : [n] — !• [g] be a random hash function. For i — l,2,...,q, define a, = Note that 

ELi <»i 



with each B® as in Theorem 
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with parameters k and 



We choose independent matrices B^\ . 
v. Here flW is a fc x a, matrix. Finally, let P be an n x n permutation matrix which, when applied to a 
matrix A, maps the rows of A in the set h~ 1 {l) to the set of rows {1,2,..., ai}, maps the rows of A in the 
set ft -1 (2) to the set of rows {ai + 1, . . . , ai + 02}, and for a general i € [<?], maps the set of rows of A in the 
set h~ l (i) to the set of rows {a\ + 02 + • • • + <ii_i + 1, . . . , a\ + a.2 + • • • + ai}- 

The map S is defined to be the product of a block-diagonal matrix and the matrix P: 



S = 



B (2) 



B (9) 



• p 



Lemma 14 5 • A can be computed in 0(nnz(A)(log(r/e))/e) time. 

Proof: As P is a permutation matrix, P ■ A can be computed in 0(nnz(^4)) time and has the same number 
of non-zero entries of A. For each non-zero entry of P ■ A, we multiply it by for some i, which takes 
0(k/v) = 0(log(r/e)/e) time. Hence, the total time to compute S ■ A is 0(nnz(^4)(log(r/e))/e). I 



4.3 Analysis 

We adapt the analysis given for sparse embedding matrices to generalized sparse embedding matrices. Again 
let U £ M. nxr have columns that form an orthonormal basis for the column space C(A). Let Ux t *, . . . , U n> * 



be the rows of U , and let Ui 



|£/j,*|| . We set the parameter: 
T ■ 



12C T log(10fc/u)(r + log(l/£))' 
where Ct is a sufficiently large absolute constant. 
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4.3.1 Vectors with small entries 



Consider a fixed vector y of at most unit norm formed by taking a vector y' € C (A) of at most unit norm, and 
replacing y[ with whenever Ui > T. Since y 2 < u,, this implies that llylj 2 ^ < T. Since P is a permutation 
matrix, we have H-Pyll^ < T. 

In this case, we can reduce the analysis to that of a sparse embedding matrix. Indeed, observe that 
the matrix B^' is the concatenation of matrices $\ D\ , . . . , <^ k j v D k j v , where each $J ' Dj is a sparse 
embedding matrix. Now fix a value j € [k/v] and consider the block-diagonal matrix Nj'. 



$(2) D (2) 
J 3 



$(«)/)(«) 



P 



Lemma 15 Nj is a random sparse embedding matrix with qv — tv/k rows and n columns. 

Proof: Nj has a single non-zero entry in each column, and the value of this non-zero entry is random in 
{+1,-1}. Hence, it remains to show that the distribution of locations of the non-zero entries of Nj is the 
same as that in a sparse embedding matrix. This follows from the distribution of the values ax, . . . , a q . I 

For j = 1, . . . ,k/v, let Et be the event Eh of Lemma [l] applied to matrix Nj 

We set S in Lemma 1 to be v/(10k) to conclude that by a union bound, C^^E^ occurs with probability 
at least 9/10. 

We apply Lemma[2]on Njy for each j = 1, . . . , k/v. We will set the parameter there to be r + log 1/e. 
The parameter "M" in Lemma^is set to be 2Cr(l/e 2 + log(l/e)/(re 2 )), and the parameter "f" in Lemma 
[2]is equal to qv — Sl(r 2 e~ 3 +re log(l/e)), which in turn is at least Mr 2 = <d(r 2 /e 2 + r log(l/e)/e 2 ) by our 
choice of q and v. Notice that the number of rows of Nj is indeed qv by Lemma 15 Also, the parameter 
"(5" in Lemma [T] equals v/(10k). We set the parameter "T" in Lemma[2]to be 



T = 



1 



6Mr log 1/5 Y1C T log(10Jfc/u)(r + log(l/e)) ' 

which agrees with our aforementioned value of T, and satisfies the requirement of Lemma [2] (and also of 
Lemma [lj. Therefore, we can indeed apply Lemma [2] 
By Lemma |2j we have for each j 6 [k/v]: 



Pr[|||^2/|^-||y||^>£]< 



(—)' 

\Mre 2 J 



2(r + logl/e) 



(r+logl/e)/2 



3 hu-2 nun-ji '--i -= \ m„ 2 I \ Mr£ 2 J y Me 2 

Plugging in our choice of M, for Ct > a sufficiently large constant, this is 



( 2 , 2 log 1/e 

Mrs 2 



(r+log l/e)/2 



Pr[|||iViy||i-||j/||il>£]< 
and so by a union bound, 

Prpj, \\\N. ] y\\l~\\v\\l\>e]<{k/v)-0{s 2 ) 
Since 



2 

Ct 



(r+log l/e)/2 



2 

Ct 



r/2 



= 0(e 2 ) 



0{e- 1 \og{r/e))-0{e 



2 

Ct 



r/2 



Ct 

IT 



-r/2 



0(1)- 



C7 1 



-r/2 



\Sy\\t 



k/v 
0=1 



it follows that 



Pr[|||^||^-||2/||^l>£]=0(l).(C T /3)- r / 2 . 
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4.3.2 Vectors with large entries 

Consider the set H of i for which it, > T. Then by our choice of T, 

. TT . r 12C T r\og(10k/v)(r + log(l/e)) _ 2n M . . . . w , ,., . ... 

F < - = ^ 4^ = 0(re log 1/e + loglogr) r + log (1/e . 

1 e z 

The following is a standard non-weighted balls-and-bins analysis. 

Lemma 16 For a large enough constant Ct > 0, with probability at least 1 — 1/r, for alii £ [q], \h~ 1 (i)PiH\ < 
C t log(r/e). 

Proof: For any given i e [q], E[|/i^ 1 (i) n H\] = \H\/q = \H\k/t. Using that 

t/k = n(re- 2 (r + log(l/e))), 



we have that this expectation is 0(log(l/e) + loglogr) = 0(log(r/e). Hence, by a Chernoff bound, for a 
constant C\ > 0, 

-e(log(r/ e )) = JL ; 

rq ' 



Prll/i" 1 ^) n H\ > C t log(r/e)] < e 



The lemma now follows by a union bound over all i G [<?] . 



Let f nTO be the event of Lemma 16 and assume £ nw occurs, which happens with probability at least 1 — 1/r. 

For i = 1, 2, . . . , q, let be the at most Ct log(r/£)-dimensional subspace which is the restriction of the 
column space C(A) to coordinates j with h~ l (j) = i and Uj > T. By Corollary 13 for any fixed i, with 
probability at least 1 — ^£/ r ) c ^ bKN , for all y € L l , HS't/Hl = (1± ^llj/Hl- By a union bound and sufficiently 
large C su bKN > 0, this holds for all i € [q] with probability at least 1 — q{e/r) CsuhKN > 1 — 1/r. Let £ s be 
the event that this occurs for all i. 

Then, conditioned on £ nw and £ s , which jointly occur with probability at least 1 — 2/r, it follows that 
ll^ylli = (1 i e )l|y|l2 f° r a H vectors y obtained by taking a vector y' g C(A) and replacing the coordinates 
7/^- with for those j for which Uj < T . 

4.4 Putting it all together 

Now consider any unit vector y in C (A) , and write it as y H + y L , where yf 1 = for coordinates i for which 
Ui < T, while yf — for coordinates i for which itj > T. We seek to bound (Sy H , Sy L ) . For notational 
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convenience, define the block-diagonal matrix Nj to be the matrix 




3 3 






$(2) D m 
] o ] 



3 3 





p 



Then S — \Jvjk ■ Y^J=\ Nj- Notice that since the set of non-zero rows of Nj and Nj> are disjoint for j ^= f, 



(Sy H ,Sy L ) = l^iNjy^Njy^) = ^(N^ , Nj y L ) , 



(6) 



3 = 1 



3 = 1 



where by Lemma 15 each Nj is a sparse embedding matrix with qv = t ■ v/k rows and n columns. 

We will apply Lemma [6] to bound each summand (Njy H , Njy L ) . We condition on events £ nw ,£ s , and 
Phl^Sl jointly occurring, which by a union bound happens with probability at least 9/10 — 2 jr. 

- r/T. We apply Lemma [6] with our values of s,t,T, and M specified above, together with 
C^ r ■ To apply Lemma [H] and Lemma |6j we need 



?3 

Also, let s 



the value 6' 

1. t,T, and M to satisfy Lemma [T] 

2 2 

Z ' 1 - log(t/<5') C T r+log(t/s) ' 

3 t > 6rlog(t/<5') _ 6r(C T r+log(t/e)) 



We have already shown that the first condition holds. Setting t = Ctr 2 e 4 \og(r/e) + Ctre 4 log(r/e) log(l/e) 
for a large enough constant Ct, we have log(t/e) < 61ogr + 61og(l/e) for r and 1/e larger than a large enough 

2 

constant. Plugging this choice of t into the above, we see that our choices of t and T = 12Ct iog(iofc/u)(r+iog(i/e)) 
satisfy the remaining two conditions for r and 1/e larger than a large enough constant. 

Notice that £ nw and £ s occurring implies that the event £r of Lemma 6 occurs. Also, nfjySl occurring 
implies that the event £h of Lemma [T] occurs. Therefore, applying Lemmasjo] and |6j given the occurrence of 
these events we have that with probability at least 1 — 6', 



\{N 3V H ,N 3 y L )\<^ie(l + e). 
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By a union bound over j £ [k/v], we have that this occurs with probability at least 1 — (k/v)6' for all j £ [k/v\. 
Therefore, with probability at least 1 - (k/v)5' = 1 - 0(1) • C^ r , it holds that \(Sy H , Sy L )\ = 0(e). In this 
case, with probability at least 1 - 0(1) • (C T /'Sy r/2 - 0(1) • C^ r , 

\\Syg = \\Sy H \\l + \\Sy L \\ 2 2 + 2(Sy H ,Sy L ) 

= (l±0(e))||y ff || 2 + (l±0( e ))|| y L ||| 2 ±0(e) 
= (l±0( e ))||y||l, 

where the last equality uses that y is a unit vector. 
The following is our main theorem in this section. 

Theorem 17 With probability at least 9/10, for t = 0(re~ 4 log(r/e)(r + log(l/e))), S is an embedding 
matrix for A; that is, for all y £ C(A), \\Sy\\2 = (1 ± £)||2/||2- S can be applied to A in 0(nnz(A)(logr)/e) 
time. 

Proof: Events £ nw ,£ s and n*£"£j[ jointly occur with probability at least 9/10 — 2 jr. Conditioned on 
their joint occurrence, the above analysis shows that for any fixed y £ C(A), ||Sy||2 = (1 ± O(er)) || y || 2 with 
probability at least 1 - 0(1) ■ (C 2 /3)~ r / 2 , where C 2 > is an arbitrarily large constant. Applying Lemma 
§ with probability at least 1 - O(l) ■ (C 2 /3)^/ 2 • C r sub > 1 - 1/r, for all y £ C(A), \\Sy\\ 2 = (1± 0(e))\\y\\ 2 . 
Hence, by a union bound, with probability at least 4/5, for all y £ C(A), \\Sy\\2 = (1 ± O(e) ) || y || 2 ■ The 
theorem follows by rescaling e by a constant factor. I 



5 Approximating the Leverage Scores 

Let A £ R nxd with rank r. Let U £ R nxr be an orthonormal basis for C(A). In [TH] it was shown how 
to obtain a (1 ± ^-approximation u[ to m for all i £ [n], for a constant e > 0, in time 0(nd\ogn) + 
0(d 3 log n log d). Here we improve the running time of this task as follows. We state the running time for 
constant e, though for general e the running time would be 0(nnz(A) logn) + poly (re" 1 logn). 

Theorem 18 For any constant e > 0, there is an algorithm which with probability at least 2/3, outputs a 
vector (u'i, . . . , u' n ) so that for all i £ [n], u\ = (1 ± £)itj. The running time is 

0(nnz(A) logn + r 3 log 2 r + r 2 logn). 

The success probability can be amplified by independent repetition and taking the coordinate-wise median of 
the vectors v! across the repetitions. 

Proof: We first run the algorithm of Theorem 2.6 and Theorem 2.7 of [7]. The first theorem gives an 
algorithm which outputs the rank r of A, while the second theorem gives an algorithm which also outputs 
the indices i\, . . . ,i r of linearly independent columns of A. The algorithm takes 0(nnz(A) logd) + 0(r 3 ) 
time and succeeds with probability at least 1 — 0(logd)/d 1 ^ 3 . Hence, in what follows, we can assume that 
A has full rank. 

We follow the same procedure as Algorithm 1 in |18j . using our improved subspace embedding. The 
proof of [18j proceeds by choosing a subspace embedding Hi, computing Hi A, then computing a change of 
basis matrix R so that HiAR has orthonormal columns. The analysis there then shows that the row norms 
IK-Ai?)^*!!?, are equal to u^lzks). To obtain these row norms quickly, an r x O(logn) Johnson-Lindenstrauss 
matrix H 2 is sampled, and one first computes RH2, followed by A(RXl2). Using a fast Johnson-Lindenstrauss 
transform Hi, one can compute Hi^4 in O(nrlogn) time. Hi has 0(r logn log r) rows, and one can compute 
the r x r matrix R in 0(r 3 logn log r) time by computing a QR- factorization. Computing RH 2 can be done 
in O(r 2 kogn) time, and computing A(RH 2 ) can be done in 0(nnz(A) log n) time. 

Our only change to this procedure is to use a different matrix n 1; which is the composition of our 



subspace embedding matrix S of Theorem 17 with parameter t = 0(r log r), together with a fast Johnson 



Lindenstrauss transform F. That is, we set Hi = F ■ S. Here, F is an 0(r log 2 r) x t matrix, see Section 
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2.3 of [IS] for an instantiation of F. Then, S ■ A can be computed in 0(nnz(A) log r) time by Lemma 14 
Moreover, F • (SA) can be computed in 0(t ■ r logr) = 0(r 3 log 2 r) time. One can then compute the matrix 
i? above in 0(r 3 log 2 r) time by computing a QR-factorization of FSA. Then one can compute -R1T2 in 
0(r 2 log n) time, and computing A(RH 2 ) can be done in 0(imz(A) log n) time. Hence, the total time is 
0(nnz(A) log n + r 3 log 2 r + r 2 log n) time. 



Notice that by Theorem 17 with probability at least 4/5, ||5y||2 = (1 ± e)||2/||2 for all y € C(A), 
and by Lemma 3 of Q2], with probability at least 9/10, \\FSy\\ 2 = (1 ± fOII^I^ for all y G C(A). Hence, 
||F5Ax|| 2 = (l±e) 2 || Ar|| 2 for all x £R d with probability at least 7/10. There is also a small l/n probability 
of failure that || (Ai?n 2 )i,* H2 7^ (1 ± e )||(A~R)j,*||2 for some value of i. Hence, the overall success probability 
is at least 2/3. 

The rest of the correctness proof is identical to the analysis in [T5] . I 

6 Least Squares Regression 

Let A £ R nxd and b £ K™ be a matrix and vector for the regression problem: min^ \\Ax — b\\ 2 . We assume 
n > d. Again, let r be the rank of A. We show that with probability at least 2/3, we can find an x' for which 

\\Ax' - 6IL < (1+e) min \\Ax — b\\ . 

X 

We will give two different algorithms, one using Theorem |11[ and the other using Theorem |17| 

Theorem 19 The l 2 -regression •problem can be solved up to a (1 + e) -factor with probability at least 2/3 in 
0(nnz(A) + d b e~ 4 log 3 (d/e)) time. 

Proof: By Theorem [Tl] applied to the column space of A adjoined with the vector b, it suffices to compute 
&DA and &Db and output a.rgmm x \\&DAx — $Z?6|| 2 . We use the fact that d > r, and apply Theorem 
with i = 0(d 4 £- 4 log 2 (d/e)). 

The theorem implies that with probability at least 9/10, all vectors y in the space spanned by the columns 
of A and b have their norms preserved up to a (1 + e)-factor. Notice that &DA and QDb can be computed 
in 0(nnz(j4)) time. Now we have a regression problem with d' = 0(d 4 e~ 4 \og 2 (d/e)) rows and d columns. 
Using the Fast Johnson-Lindenstrauss transform, this can be solved in 0(d'd\og(d/e) + d 3 e~ 1 logd) time, 
see, Theorem 12 of [3^. The success probability is at least 9/10. This is 0(d 5 e~ 4 log 3 (c?/e)) time. I 

Theorem 20 The l^-reyression problem can be solved up to a (1 + e) -factor with probability at least 2/3 in 
0(nnz(A)(logd + log(l/e)) + r 3 e" 2 (log(r/e))(logl/e)) time. 

Proof: We first run the algorithm of Theorem 2.6 and Theorem 2.7 of [7\. The first theorem gives an 
algorithm which outputs the rank r of A, while the second theorem gives an algorithm which also outputs 
the indices ii, . . . ,i r of linearly independent columns of A. The algorithm takes 0(nnz(A) logd) + 0(r 3 ) 
time and succeeds with probability at least 1 — Oilogd) / d 1 / 3 . Hence, in what follows, we can assume that 
A has full rank. 

We then apply the subspace embedding of Theorem 11 to A 06, which takes 0(imz(A) +r 4 e~ 4 log 2 (r/e)) 
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time and produces a matrix A' = &DA and vector b' = where A' is r' x r and b' is r 1 x 1 r' = 

0{r 4 e~ 4 log 2 (r/e)). Notice that nnz(A') < nnz(A) by definition of Observe that by Theorem |ll[ with 
probability at least 9/10, for all x £ W, \\A'x - b'\\ 2 = (1 ± e)\\Ax - b\\ 2 . 
Next, we need the following theorem of Dasgupta et al. 

Fact 21 (Theorem 5 of J2J/, restated in our notation) Let A be an n x d matrix of rank r. Let U be an 
orthonormal basis for C(A), and let us randomly sample rows of A according to any distribution (p±, . . . ,p n ) 
for which for all i, pi > min(l,(7 • Ui/r), where q > 32 2 r(r ln(12/e) + ln(2/(5))/(4e 2 ). Then with probability 
1 — S, the following holds for all x £ M. d : 

\\\SAx\\ 2 ~\\Ax\\ 2 \<e\\Ax\\ 2 , 

where S is a diagonal matrix where Sij = 1/ \fpl if we sample row i, and Si^ = otherwise. 
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We run the algorithm of Theorem 18 on matrix A' to obtain a vector v! with u[ i > hiii for all i € 



We 



set ^ = min(l, 2g • u^/r), where q = B(r 
a small constant. Then, the number of i 
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with the 5 there set to equal 



: £~ 2 ln(l/e)) is chosen as per Fact 

for which 2q ■ u'Jr > 1 is at most 2q, since ^ Uj = r. In addition 
to these rows being sampled, the expected number of rows sampled is at most 



-o E 



= 0(g), 



and so by a Markov bound the number of samples is O(g) with arbitrarily large constant probability. The 
time complexity is 0(nnz(A') log(r/e) + r 3 log 2 r + r 2 log(r/e)). 

Now we now have a regression problem on an 0(q) x r matrix SA. Using the Fast Johnson-Lindenstrauss 
transform, this can be solved in 0(qrlogq + r 3 e~ 1 logr) time, see [3j|l IM]- The success probability is at least 
9/10. This is 0(r 3 £- 2 log(r/e) log(l/e)) time. 

Hence, the overall time complexity is as claimed, and the success probability is at least 2/3. I 



7 Low Rank Approximation 

While theorems Theorem [TT] and Theorem [17| are stated in terms of specific constant probability of success, 
they can be re-stated and proven so that the failure probabilities are arbitrarily small, but still constant. 
In the following we'll assume that adjustments have been done, so that the sum of a fixed number of such 
failure probabilities is at most 1/5. 

7.1 Preliminaries 

We collect a few standard lemmas and facts in this subsection which we need for low rank approximation. 

Lemma 22 (Approximate Matrix Multiplication^ For A and B matrices with n rows, and given e > 0, 
there is t = 0(e~ 2 ), so that for a t x n generalized sparse embedding matrix S, or t x n fast JL matrix, or 
subsampled randomized Hadamard matrix, 

Pv[\\A T S T SB - A T B\\ 2 F < e 2 ||A||^||i?|| 2 F ] > 6, 

for any fixed S > 0. 

Proof: For such a matrix with parameters k and v, first suppose v = 1, so that S is the embedding matrix of 
^2j Let X = A T S T SB—AB. Then X itj = Aj S T SB] -Aj B v where A t is the i-th column of A and B j is the 

j'-th column of B. Thorup and Zhang [40] have shown that E[X M ] = and Var[X itj ] = 0{l/t)-\\Ai\\ 2 2 , \\B \\ 2 . 

Consequently, E[X?.] = Var[Xj j] = 0(l/t) ■ II^II^H^I^; horn which for an appropriate t — 8(e~ 2 ), the 

lemma follows by Chebyshev's inequality. For v > 1, X^j — | J2ie[t/v] ^i,v see ©' so t na t' 

VarfXij] = ^Var[X tJ ] < JlMMIaM 2 < ill^llall^ll^ 

i 

and similarly the lemma follows for the sparse embedding matrices. The result for fast JL matrices was 
shown by Sarl6s[3J|, and for subsampled Hadamard by Drineas et al. |24] : also the claim follows from norm- 
preserving properties of these transforms, see [28]. I 

Fact 23 Given n x d matrix A of rank k < n 1 / 2-7 for 7 > 0, and e > 0, an m x n fast JL matrix H with 
m = 0(fc/e 2 ) is a subspace embedding for A with failure probability at most 8, for any fixed 8 > 0, and 
requires 0(ndlogn) time to apply to A. 

A similar fact holds for subsampled Hadamard transforms. 
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Fact 24 (Normal Equations) Given n x d matrix C , and n x d' matrix D consider the problem 

min \\CX-D\\ 2 F . 

The solution to this problem is X* = C'~ D, where C~ is the Moore-P enrose inverse of C . Moreover, 
C T (CX* — D) = 0, and so if c is any vector in the column space of C, then c T (CX* — D) = 0. 

Fact 25 (Pythagorean Theorem) If C and D matrices with the same number of rows and columns, then 
C T D = implies \\C + D\\ 2 F = \\C\\ F + \\D\\ 2 F . 

7.2 An Intermediate Theorem 

The main theorem in this subsection is the following. It is very close to Lemma 1 of |24) . 

Theorem 26 Suppose A and B are matrices with n row s, and A has rank at most k. Suppose S is a t x n 



matrix, and the event occurs that S satisfies Lemma 22 with error parameter y/e/k, and also that S is a 



subspace embedding for A with error parameter eo < l/v2. Then if X is the solution to 

mhi\\S(AX - B)f F , (7) 



and X* is the solution to 
then 



v±i\\AX - B\\%, (8) 



\AX-B\\ F < (l + e)\\AX*-B\\ F . 



Before proving Theorem 26 we will need the following lemma. 
Lemma 27 For S, A, B, X* and X as in Theorem 
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\\A(X-X*)\\ F < 2V^\\B-AX*\\ F . 

Proof: Let A = UY<V T denote the singular value decomposition of A. Since A has rank at most k, we 
can consider U and V to have at most k columns. Observe that 

\\A(x - x*)\\ F = \\uj:v t (x -x*)\\ f = \\u t uzv t (x - x*)\\ F = \\p\\ F , 



where f3 := U T A(X - X*). By Fact[24j we have A T S T S{AX - B) = 0. Since V T Y, has full column rank, 
the normal equations imply 

U T S T S(AX - B) = 0. (9) 

To bound ||/3|) F , we bound ||?7 T S' T S'C//3|| F , and then show that this implies that \\P\\ F is small. Using that 
U T U = I and we have 

U T S T SUf3 = U T S T SUU T A{X - X*) = U T S T SA(X - X*) + U T S T S(B - AX) = U T S T S(B - AX*). 
Using the hypothesis of the theorem, 

T> A V*MI s . /TTuUttW II U /IV* II S . /7II U /IV*I| 

To show that this bound implies that \\P\\ F is small, we use the property of any conforming matrices C and 
D, that \\CD\\ F < \\C\\ 2 \\D\\ F , obtaining 

\\(3\\ F < \\U T S T SUp\\ F + \\U T S T SUP -0\\ F < y/e\\B - AX*\\ F + \\U T S T SU - I\\ 2 \\0\\ F . 

By the hypothesis of the theorem, ||S'C/a;|| 2 = (1 ± eo)||a;|| 2 for all x, so that U T S T SU — I has eigenvalues 
bounded in magnitude by £q, which implies singular values with the same bound, so that \\U T S T SU — I\\ 2 < 
eg. Thus \\(3\\ F < y/i\\B - AX*\\ F + e 2 \\f3\\ F , or 



\U T S T SUP\\ F = \\U T S T S(B- AX*)\\ F < yfi/k\\U\\ F \\B - AX*\\ F < Ve\\B - AX* 



\\P\\ F <V~4B- AX*\\ F /(l - e 2 ) < 2^e\\B - AX 
since eg < 1/2. This bounds ||/3|| F , and so proves the lemma. 



F ' 
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Proof of Theorem 

columnspace, 



26 



Let UT,V T denote the SVD of A. By Fact 24 and since U and A have the same 



U T (AX* - B) = A T (AX* - B) = 0. 
This and Fact [25l the Pythagorean Theorem, imply 



\AX-B\\ 



\AX* 



B\ 



\A(X -X* 



2 

If 



(10) 

(11) 



which with Lemma [27} implies that 



\AX-B\\ F < (l + 2e)\\AX* 



B\ 



F ' 



using y/1 + 4e < 1 + 2e. Adjusting and renaming e, the theorem follows. I 

We use the following observations to analyze the composition of our sparse embedding matrices with fast 
JL matrices. 

Fact 28 If S £ M. txn approximates matrix products and is a subspace embedding with error e and failure 
probability 5s, and II £ ]R* X * approximates matrix products with error e and failure probability Sn, then US 
approximates matrix products with error 0(e) and failure probability at most 5s + <5n- 



Proof: This follows from two applications of Lemma 22 together with the observation that ||5Aa;|| = 
(1 ± e)||Aa;|| for basis vectors x implies that ||5A|| F = (1 ± e)||yl|| F . I 



Fact 29 If S £ 



is a subspace embedding with error e and failure probability 5s, and H £ 



is a 



subspace embedding with error e and failure probability 5u, then IIS is a subspace embedding with error O(e) 
and failure probability at most 5s + ^n- 



7.3 Putting it All Together 

Let Aft = || A — .Afc|| F , where Ak is the best rank-fc approximation to A. The following theorem is the same 
as Theorem 4.7 of [S] except that we use our sparse embedding matrices instead of the dense sign matrices 
used in [8]. 

Theorem 30 Let A be an n x n matrix, and k £ [n] with k < ti 1 / 2-7 with 7 > 0. Let R and S be 
independent t v x n and t' v x n generalized sparse embedding matrices, respectively, for parameters v = 1 with 
t = t 1 = Q(k/e + k 4 \og 2 k) and H x = 6((fc/e) 4 log 2 (fc/e)) or v JL = 8(logr) with t V]L = Q(k/e + k 2 ) and 
t' VsL = e(fce" 4 log(fc/e)(fc + log(l/e))). Let n 5 be a fast JL operator from R< to R°( k / e2 \ and let Il R be a 
fast JL operator from E*» to R 0( - k / e \ Let S = U S S, and R = U R R. Let 

A = AR T {SAR T )-SA. 

Then 

Pv[\\A -A\\ F <(l + e)\\A-A k \\ F ]>l- 5, 

for any fixed 5 > 0. Moreover, AR T andZA and (SAR T )~ can be computed in 0(v 2 imz(A))+(n+t v )t' v log n) 
time. Alternatively, a similar decomposition can be computed, without fast JL matrices, in 0(v 2 nnz(^4) + 
poly(fe/e)) time, with factors of dimension n x m, m x ml , and m! x n, where m,m' = (9(poly(fc/e)). 
For k > n 1 ! 2 ^ 1 , the subsampled randomized Hadamard transforms can replace the fast JL transforms, with 
log 2 (fc/e) increase in size for the transforms. 



Proof: For the given parameters, the event of Theorem 26 occurs for S and a rank Q(k/e) matrix with 
arbitrarily small constant failure probability, since t' v — il(k/e 2 ) suffices for Lemma 22 to apply, and t[ 



suffices for Theorem 11 to apply, while t' v suffices for Theorem 17 to apply, and similarly for II5, and for 
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S using Facts 28 and [29j with a total failure probability that is an arbitrarily small constant. We apply 
Theorem 26 to S, with k, A, and B of the theorem mapping to k/e, AR T , and A, respectively. 



The result is that with small fixed failure probability, for X the solution to 

mm\\SAR T X - SA\\ F , 



x 

we have 



\AR X — A\\ F < (l + e)||AR'X* - A\\ F = (1 + e) min ||AR 1 X - A\\ F . 



Similarly the parameters for R suffice for Theorem 26 to apply with k, A, B, and t of the theorem mapping 



to k, Aj , A T , and t v , respectively, and similarly for Hr and R. 
We have, with small fixed falure probability, 

\\AR T X* -A\\ F <(1 + e)\\A - A k \\ F = (1 + e)A k , (12) 
because Y = AR T (A k R T )~ has, by Theorem 



26 



\\YA k - A\\ < (1 + e) min \\YA k - A\\ = (1 + e)A k , 

and so 

min \\AR T X - A\\ < \\AR T (A k R T )~ A k - A\\ < (1 + e)A fc . 

Since X = (H S SAR T )- SA, we have 

\\AR T (SAR T )-SA -A\\ F = \\AR T X-A\\ F < (1 + e)\\AR T X* - A\\ F < (1 + e) 2 A k , 

and the first part of the theorem follows, after adjusting e by a constant factor. 

For the time complexity, because R and S are sparse embedding matrices, we can compute SA and AR T 
in 0(v imz(A))) time, and because Hr and II5 are fast JL matrices, we obtain SA and AR T in 0{nt' v logn) 
time. From this, we can compute SAR T in 0(v 2 nnz(A) + t v t' v logn) time, and (SAR T )~ in 0(fc 3 /e 4 ) time. 

■ 

Theorem 31 (Main) Given an n x n matrix A, with probability at least 3/5, we can compute the following 
triple of matrices [L, D, W), where L is an n X k matrix, D is a k x k diagonal matrix, and W is a k x n 
matrix, with the property that 

\\A-L-D-W\\ F <(l + s)A k , 

that is, L ■ D ■ W is within (1 + e) of the best rank-k approximation. Furthermore, L, D, and W can be 
computed in 0(v 2 nnz(A)) + (n + t' v (e 2 ))t' v (e 2 )) time, where t[(e 2 ) = 9((fc/e 2 ) 4 log 2 (fc/e)) ; and t' V]L (e 2 ) = 
6(A;e- 8 log(A;/e)(fc + log(l/e))). 



Proof: By Theorem 30 we can compute AR T , (SAR T ) , and SA in the given time, for which with 
constant probability, 

\\A-A\\ F <(l + e)A k , 

where A = AR T (S AR J S A. 

By Theorem 4.8 of [8], under these conditions it follows that if U is an orthonormal basis for the column 
space of AR T , then 

\\A-U[U T A] k \\ F <(l + ^)A k , 

where [J7 T A]fc denotes the best rank-fc approximation to U T A. We replace e by e 2 adjust the t v = t v (e 2 ) 
values accordingly, and have 

\\A-U[U T A} k \\ F < (l + e)A k . 

Notice that U[U T A] k is a matrix of rank k. Given AR T , we can compute U in time 0(nk 2 /e 4 ), since 
AR T is n x 0(k/e 2 ). We can also compute U T A in 0(nk 2 /e 6 ) time by first computing U AR T , then 
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U T AR T (SAR T )~ , and then U T AR T (SAR T )~ ZA. We can compute the best rank-fc approximation [Z7 T A] fc 
to U T A in time 0{nk 2 /e A ) since U T A is an 0(k/e 2 ) x n matrix. Further, we can compute the SVD UT,V T 
of [U T A] k in 0{nk 2 /^) time. 

Finally, we set L = UU, D = S, and W = V T . It follows that L, D, and W can be computed in 
0(v 2 nnz(a)) + (n + t' v (e 2 )t' v (e 2 ) + nk 2 /e 4 ) time, where the last summand can be omitted since the previous 
one dominates. By construction of L, D, and W, the requirements of the theorem are satisfied. This 
completes the proof. I 



8 Preliminary Experiments 



Some preliminary experiments show that the low-rank approximation technique of Theorem 30 is quite 
promising, and in practice may perform much better than the general bounds of the theorem. In the 
experimental version, the matrices tested are nxd, with d < n without loss of generality, and the embedding 
matrices are t x n and dx t 2 (or the original matrix for large t). 

The resulting low-rank approximation was tested for t taking values of the form |1. 4* J , with t < d/4, and 
for each such t, taking the ratio R e of the Frobenius norm of the error with the Frobenius norm of the best 
rank-fc approximation, for all k < t. The resulting points (t/k,R e ) were generated, for all test matrices, 
resulting in a set of points P. (Actually, three instantiations of the low-rank approximations were done, 
generating different embedding matrices, and the results are the middle value of the three.) 

The test matrices are from the University of Florida Sparse Matrix Collection, essentially all those with 
at most 10 5 nonzero entries, and with n at most about 600, because the tests required some slow operations 
and were done on a laptop. However, despite such restrictions, over 500 matrices were tested, taken from 40 
sub-collections, each such sub-collection representing a particular application area. 

The results are shown in two plots. Figure 1 shows the fc-extremal curves of the point set P, for 
k = 1, 3, 5, with y-coordinates that log(i? e ), using the natural logarithm. Here a point (r, p) on a fc-extremal 
curve implies that for all but k matrices, tests run with t/k > r yielded ratios log(i? e ) < p. Although the 
curves are shown for R e > 1, for many matrices R e < 1 for large enough t/k. 

In Figure 2, only the 5-extremal curve is shown, for t/k > 4 (and the y coordinate is R e , not its log). We 
see for t/k > 4, the error ratio for all but 5 matrices is no more than 1.21, and this decreases to less than 
1.01 for t/k about 11. 
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